Red Wine Data Exploration by Shivam Bhardwaj
What properties contributes in making of good red wine? In this project we try to answer this question by exploring the red wine data set.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :0.900
## 1st Qu.: 7.100 1st Qu.:0.3950 1st Qu.:0.0900 1st Qu.:1.900
## Median : 7.900 Median :0.5200 Median :0.2500 Median :2.200
## Mean : 8.259 Mean :0.5288 Mean :0.2661 Mean :2.409
## 3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4200 3rd Qu.:2.600
## Max. :13.200 Max. :1.5800 Max. :1.0000 Max. :8.300
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 21.25
## Median :0.07900 Median :13.00 Median : 37.00
## Mean :0.08699 Mean :15.17 Mean : 44.52
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 60.00
## Max. :0.61100 Max. :46.00 Max. :144.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9967 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.316 Mean :0.6569 Mean :10.43
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7275 3rd Qu.:11.10
## Max. :1.0029 Max. :4.010 Max. :2.0000 Max. :14.00
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## 'data.frame': 1534 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Univariate Plots Section
This red wine data set contains 1,599 obersvations with 11 variables on the chemical properties of the wine.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Quality Distribution
The wine quality grade is a discrete number. It is ranged from 3 to 8. The median value is at 6.

Distribution of Other Chemical Properties

## bad average excellent
## 62 1264 208
# Univariate Analysis
Some observation on the distribution of the chemical property can be made:
Normal: Volatile acidity, Density, PH
Positively Skewed: Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol
Long Tail: Residual sugar, Chlorides
Rescale Variable
Skewed and long tail data can be transformed toward more normally distribution by taking square root or log function. Take Sulphates as a example, we compare the original, square root and log of the feature.

What is the structure of your dataset?
There are 1534 observations after slicing out the top 1% from the variables that had large outliers (Fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide)
What is/are the main feature(s) of interest in your dataset?
Quality is the main feature. I want to determine what makes a wine taste good or bad.
What other features in the dataset do you think will help support your analysis
Did you create any new variables from existing variables in the dataset?
Yes, I created a rating variable which is a subset of quality based on three distinct categories: (bad: 4,5), (average: 5,6), (excellent: 7,8)
Bivariate Plots Section
Plot matrix was used to have a glance at the data. We are interested the correlation between the wine quality and each chemical property.
The top 4 factors that is correlated with the wine quality (with a correlation coeffcient greater than 0.2)
| alcohol |
0.49 |
| volatile.acidity |
-0.39 |
| sulphates |
0.256 |
| citric.acid |
0.223 |

From the above table and plot matrix we see “fixed.acidity”, “volatile.acidity” and “pH” has some correlation with “citric.acid”. Interestingly, density has some correlation with “fixed.acidity” and “alcohol”. Also, “quality” has some correlation with “alcohol”.
To see if the data makes sense chemically, I first plot pH and fixed acidity. The correlation coefficient is -0.68, meaning that pH tends to drop at fixed acidity increases, which makes sense.

## [1] -0.6794406
The correlation between citric acid and pH is slightly weaker, being -0.52. This adds up as citric acid is a subset of fixed acidity.

## [1] -0.5283267
Volatile acidity (acetic acid) seems to increase when pH level increases. The correlation coefficient was 0.24 indicating some positive correlation.

## [1] 0.2387919
I want to further explore alcohol, pH, volatile acidity, citric acid, and sulphates and see how they relate to the quality of the wine as they all had correlation coefficients greater than 0.2. Box plots are used and we use the median as a better measure for the variance in our data. As predicted, the median also follows suit with the correlation coefficients. The boxplots provide an extremely interesting fact about alcohol – alcohol content is significantly higher for excellent wines compared to bad or average wines. Sulphates and citric acid also seem to be positively correlated to to quality, and volatile acidity appear to be negatively correlated.

## redwine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.302 3.380 3.385 3.500 3.900
## --------------------------------------------------------
## redwine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.210 3.310 3.315 3.402 4.010
## --------------------------------------------------------
## redwine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.280 3.295 3.380 3.780

## redwine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.20 10.98 13.10
## --------------------------------------------------------
## redwine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.50 10.00 10.26 10.90 14.00
## --------------------------------------------------------
## redwine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.50 10.80 11.60 11.54 12.22 14.00

## redwine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5800 0.6800 0.7306 0.8838 1.5800
## --------------------------------------------------------
## redwine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## redwine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3100 0.3700 0.4090 0.4925 0.9150

## redwine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0750 0.1713 0.2675 1.0000
## --------------------------------------------------------
## redwine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2538 0.4000 0.7600
## --------------------------------------------------------
## redwine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.3950 0.3687 0.4900 0.7600

## redwine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4925 0.5600 0.5927 0.6000 2.0000
## --------------------------------------------------------
## redwine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6457 0.7000 1.9800
## --------------------------------------------------------
## redwine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7444 0.8200 1.3600
However, none of the variables share much in common with alcohol - the highest is pH, which had a correlation coefficient of 0.22. However, alcohol and quality have a 0.49 correlation coefficient, which may be leading me somewhere.
It appears that when citric acid is in higher amounts, sulphates are as well. The freshness from the citric acid and the antimicrobial effects of the sulphates are likely correlated. The correlation coefficient was 0.33 which indicates weak correlation, but still noteworthy.

## [1] 0.3302825
When graphing volatile acidity and citric acid, there is clearly a negative correlation between the two. It seems that fresher wines tend to avoid the use of acetic acid. The correlation coefficient was -0.57, indicating that larger amounts of citric acid meant smaller amounts of volatile acidity. Since volatile acidity is essentially acetic acid, the wine makers would likely not put a large amount of two acids in the wine, leading them to choose one or the other.

## [1] -0.5629224
There is no particularly striking relationship between alcohol and pH – a weak positive correlation of 0.22.

## [1] 0.2166557
Bivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
It appears that when citric acid is in higher amounts, sulphates are as well. The freshness from the citric acid and the antimicrobial effects of the sulphates are likely correlated. Volatile acidity and citric acid are negatively correlated. It is likely that fresher wines avoid the bitter taste of acetic acid. Citric acid and pH were also negatively correlated – a lower pH indicates a higher acidity. pH and alcohol are very weakly correlated. Pure alcohol (100%) has a pH of 7.33, so when it is diluted it will likely increase the pH level ever so slightly.
The boxplots reveal an interesting picture as well:
- The median for sulphates increased for each quality type. The biggest jump was from average to excellent, with a median of aproximately 0.74 for excellent and 0.61 for average.
- Citric acid had the highest concentration for excellent wines. The median jumped evenly throughout the different quality categories. With medians of 0.075 for bad, 0.24 for average, and 0.395 for excellent.
- As volatile acidity increased, the median for the wine became worse, with medians of 0.68 for bad, 0.54 for average, and 0.37 for excellent. It’s possible that past a certain threshold, the acetic acid vecame too bitter for the tasters.
- The median for alcohol content (10%) was the same the wine was bad or average. However, for the excellent wines, the alcohol content was 11.6%. This leads to a striking observation: a higher alcohol content may make a wine excellent from average, however there are other factors at play that make a wine taste bad.
- pH didn’t change significantly much between the wines, with medians of 3.38 for bad, 3.31 for average, and 3.280 for excellent.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Volatile acidity and citric acid were negatively correlated, as were citric acid and pH. Fixed acidity and pH were negatively correlated, due to the lower pH/more acidic effect.
What was the strongest relationship you found?
From the variables analyzed, the strongest relationship was between Citric Acid and Volatile Acidity, which had a correlation coefficient of -0.563.`
Multivariate Plots Section
Main Chemical Property vs Wine Quality
With different colors, we can add another dimension into the plot. There are 4 main features.Alcohol, volatile acidity are the top two factor that affect wine quality.

The figure looks over ploted, since the wine quality are discrete numbers. We can use jitter plot to alleviate this problem

We can see higher quality wine have higher alcohol and lower volatile acidity.
Add Another Feature
Now we add the third feature, the log scale of sulphates, and use different facet to show wine grade.

We can see higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates (hue).
Main Chemical Properties vs Wine Quality
Since we can visualized 3 dimensions, including wine quality, at a time. Two graphs will be needed to visualize the 4 main chemical properties.

The same trend of alcholand volatile acidity’s effect on wine qaulity can be observed.

We can see higher quality wine have higher sulphates (x-axis), higher citric acidity (y-axis).
Linear Multivariable Model
Linear multivariable model was created to predict the wine quality based on chemical properties.
The features are selected incrementally in order of how strong the correlation between this feature and wine quality.
##
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = redwine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = redwine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates,
## data = redwine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid, data = redwine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides, data = redwine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide, data = redwine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide + density,
## data = redwine)
##
## ======================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## ------------------------------------------------------------------------------------------------------
## (Intercept) 6.567*** 2.942*** 2.451*** 2.492*** 2.608*** 2.875*** -3.001
## (0.060) (0.190) (0.201) (0.207) (0.208) (0.214) (13.026)
## volatile.acidity -1.761*** -1.361*** -1.193*** -1.242*** -1.132*** -1.082*** -1.097***
## (0.108) (0.098) (0.100) (0.117) (0.120) (0.119) (0.124)
## alcohol 0.327*** 0.322*** 0.322*** 0.305*** 0.286*** 0.291***
## (0.016) (0.016) (0.016) (0.017) (0.017) (0.021)
## sulphates 0.694*** 0.711*** 0.886*** 0.943*** 0.935***
## (0.102) (0.105) (0.113) (0.113) (0.114)
## citric.acid -0.087 0.017 0.052 0.022
## (0.108) (0.111) (0.110) (0.128)
## chlorides -1.632*** -1.778*** -1.752***
## (0.412) (0.410) (0.414)
## total.sulfur.dioxide -0.003*** -0.003***
## (0.001) (0.001)
## density 5.854
## (12.976)
## ------------------------------------------------------------------------------------------------------
## R-squared 0.1 0.3 0.3 0.3 0.4 0.4 0.4
## adj. R-squared 0.1 0.3 0.3 0.3 0.3 0.4 0.4
## sigma 0.7 0.7 0.7 0.7 0.7 0.6 0.6
## F 267.7 366.4 266.7 200.2 164.8 143.2 122.7
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1730.3 -1553.8 -1531.2 -1530.8 -1523.0 -1511.5 -1511.4
## Deviance 857.2 681.0 661.2 660.9 654.2 644.5 644.4
## AIC 3466.6 3115.6 3072.3 3073.7 3060.0 3039.0 3040.8
## BIC 3482.6 3137.0 3099.0 3105.7 3097.4 3081.7 3088.8
## N 1534 1534 1534 1534 1534 1534 1534
## ======================================================================================================
Multivariate Analysis
Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?
Based on the multivariate analysis, five features stood out to me: alcohol, sulphates, citric acid, volatile acidity, and quality. Throughout my analysis, chlorides and residual sugar lead to dead ends. However, high volatile acidity and low sulphates were a strong indicator of the presence of bad wine. High alcohol content, low volatile acidity, higher citric acid, and lower sulphates all made for a good wine.
Were there any interesting or surprising interactions between features?
Surprisingly, other chemical proprieties do not have strong correlation with wine quality, such as the residual sugar and PH .
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model
Yes, I created a linear model using seven variables: alcohol, citric acid, sulphates, volatile acidity, chlorides, total.sulfur.dioxide and density. The model was less precise in predicting qualities of 3, 4, 7, and 8, where the error was +/- 2. For qualities of 5 and 6, the majority of predictions were off by 0.5 and 1 for each bound. The limitations of this model are obvious – I’m trying to use a linear model for data that obviously isn’t perfectly linear.
Final Plots and Summary
Plot One: Alcohol and Quality

## redwine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.20 10.98 13.10
## --------------------------------------------------------
## redwine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.50 10.00 10.26 10.90 14.00
## --------------------------------------------------------
## redwine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.50 10.80 11.60 11.54 12.22 14.00
Description One
This graph was interesting because it showed how excellent wines tended to have a higher alcohol content all else equal. By this I mean certain precursors had to exist for alcohol to be the predominant determininant for quality.
Plot Two: Alcohol & Sulphates vs. Quality

Description Two
Observe that lower sulphates content typically leads to a bad wine with alcohol varying between 9% and 12%. Average wines have higher concentrations of sulphates, however wines that are rated 6 tend to have higher alcohol content and larger sulphates content. Excellent wines are mostly clustered around higher alcohol contents and higher sulphate contents.
This graph makes it fairly clear that both sulphates and alcohol content contribute to quality. One thing I found fairly interested was that when sulphates were low, alcohol level still varied by 3%, but the wine was still rated bad. Low sulphate content appears to contribute to bad wines.
Plot Three: Volatile Acidity vs Quality

Description Three
As we can see, when volatile acidity is greater than 1, the probability of the wine being excellent is zero. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. However, when volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Moreover, any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad. Therefore, volatile acidity is a good predictor for bad wines.
Reflection
This analysis was conducted conducted with the view of trying to uncover hidden insights by move a step at a time and proceeding further or retracting backwards based on the outcome. It was at times unbelievable at times when the hypothesis was incorrect, but it did make sense. The most important thing that influenced the direction on the analysis was some sort of patterns that unravelled.
The biggest struggle in this process was working though the number of iterations needed to get the results out correctly, which in itself is a very tedious process. I felt like giving up at times, but instead I decided to work through it one step at a time.
In the future analysis, it would make sense to carry out analysis based on the free radicals.
The take aways from this analysis are that wines with high quality tend to have higher alcohol content and low residual sugar. Another interesting finding was that citric acidity decreases with pH levels. So, wines with lower acidty have higher citric acid content.
In conclusion, if you are looking for a good bottle of wine. It will most like have very little sweetness to it, and good amount of alchohol.